You will often find yourself in a situation where you will be dealing with an incomplete dataset. There are many reasons why data may be missing: survey responses may have been optional, there may have been some sort of data recording error, or the information may simply just not be available. There are a plethora of ways to handle such situations, several of which we will explore in this exercise.
1 - Go do the Machine Learning Repository and download the Beijing PM2.5 Data, put it into a dataframe, giving the columns the proper names. Also be sure to familarize yourself with the data set before proceeding.
In [1]:
import pandas as pd
import numpy as np
In [2]:
pm2 = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/00381/PRSA_data_2010.1.1-2014.12.31.csv',
na_values='NA')
pm2.columns = ['id', 'year', 'month', 'day', 'hour', 'pm2', 'dew_point', 'temperature',
'pressure', 'wind_dir', 'wind_speed', 'hours_snow', 'hours_rain']
In [3]:
pm2.head()
Out[3]:
In [4]:
pm2.info()
There ore over 2000 samples with the pm 2.5 value missing: since this is the value to predict I am going to drop them.
In [5]:
pm2.dropna(inplace=True)
In [6]:
pm2.describe().T
Out[6]:
In [7]:
pm2.describe(include=['O'])
Out[7]:
In [8]:
pm2.wind_dir.value_counts()
Out[8]:
2 - Suppose our data became corrupted after we downloaded it and values were missing. Randomly insert 5000 NaN
into the dataset accross all the columns.
In [9]:
# setting the seed
np.random.seed(0)
# creating an array of dimension equal to the number of cells of the dataframe and with exactly 5000 ones
dim = pm2.shape[0]*pm2.shape[1]
arr = np.array([0]*(dim-5000) + [1]*5000)
# shuffling and reshaping the array
np.random.shuffle(arr)
arr = arr.reshape(pm2.shape[0], pm2.shape[1])
# looping through all the values and setting the corresponding position in the dataframe to nan
it = np.nditer(arr, flags=['multi_index'])
while not it.finished:
if it[0] == 1:
pm2.iloc[it.multi_index[0], it.multi_index[1]] = np.nan
it.iternext()
In [ ]:
# solution: inserted nans on all columns at random
data_na = pm2.copy()
nrow = data_na.shape[0]
for col in data_na:
rows = np.random.randint(0, nrow, 5000)
data_na[col].iloc[rows] = np.nan
In [10]:
pm2.info()
3 - Which variables lend themselves to be in a regression model? Select those variables, and then fit a regression model for each of the following imputation strategies, commenting on your results.
- Dropping all rows with at least 1 NA
- Dropping all rows with at least 3 NA
- Imputing 0
- Mean
- Median
- Mode
In [11]:
# I'm dropping wind_dir and id
regr_cols = ['year', 'month', 'day', 'hour', 'dew_point', 'temperature',
'pressure', 'wind_speed', 'hours_snow', 'hours_rain', 'pm2']
pm2_regr = pm2.loc[:, regr_cols]
# in the solution there is no year, month, day and hour
# also, he discards hours_snow and hours_rain (though they aren't binary or categorical)
In [12]:
# from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer
from sklearn.linear_model import LinearRegression
In [13]:
lr = LinearRegression()
# X = pm2_regr.iloc[:, :-1]
# y = pm2_regr.iloc[:, -1]
# Xtrain, Xtest, ytrain, ytest = train_test_split(pm2_regr.iloc[:, :-1], pm2_regr.iloc[:, -1], test_size=0.2, random_state=0)
In [14]:
#just a note to self
pm2_regr1 = pm2_regr.dropna(thresh=7) # same as dropna without thresh
# thresh is the number of non nan columns required to mantain the rows
pm2_regr1 = pm2_regr.dropna(thresh=5)
pm2_regr1.info()
Dropping all rows with at least 1 NA:
In [15]:
lr.fit(pm2_regr.dropna().iloc[:, :-1], pm2_regr.dropna().iloc[:, -1])
lr.score(pm2_regr.dropna().iloc[:, :-1], pm2_regr.dropna().iloc[:, -1])
Out[15]:
Dropping all row with at least 3 NA gets me an error because I have nans in some rows:
In [16]:
lr.fit(pm2_regr.dropna(thresh=5).iloc[:, :-1], pm2_regr.dropna(thresh=5).iloc[:, -1])
lr.score(pm2_regr.dropna(thresh=5).iloc[:, :-1], pm2_regr.dropna(thresh=5).iloc[:, -1])
Imputing 0:
In [17]:
lr.fit(pm2_regr.fillna(0).iloc[:, :-1], pm2_regr.fillna(0).iloc[:, -1])
lr.score(pm2_regr.fillna(0).iloc[:, :-1], pm2_regr.fillna(0).iloc[:, -1])
Out[17]:
Imputing the mean:
In [18]:
imp = Imputer(strategy='mean')
pm2_regr_mean = imp.fit_transform(pm2_regr)
In [19]:
lr.fit(pm2_regr_mean[:, :-1], pm2_regr_mean[:, -1])
lr.score(pm2_regr_mean[:, :-1], pm2_regr_mean[:, -1])
Out[19]:
The median:
In [20]:
imp = Imputer(strategy='median')
pm2_regr_median = imp.fit_transform(pm2_regr)
In [21]:
lr.fit(pm2_regr_median[:, :-1], pm2_regr_median[:, -1])
lr.score(pm2_regr_median[:, :-1], pm2_regr_median[:, -1])
Out[21]:
And the mode:
In [22]:
imp = Imputer(strategy='most_frequent')
pm2_regr_mode = imp.fit_transform(pm2_regr)
In [23]:
lr.fit(pm2_regr_mode[:, :-1], pm2_regr_mode[:, -1])
lr.score(pm2_regr_mode[:, :-1], pm2_regr_mode[:, -1])
Out[23]:
The best result I get is from simply dropping all rows with NAs, mean and median gives similar performances while the mode is the worst imputation (surprisingly worst than imputing 0, which is quite random).
Overall all strategies doesn't yield good results, I guess this fit is bad in general.
4 - Given the results in part (3), and your own ingenuity, come up with a new imputation strategy and try it out. Comment on your results.
I'm going to drop rows with NAs for the columns year, month and hour, pm2; I'm imputing the median for all other columns:
In [24]:
pm2_regr_imp = pm2_regr.dropna(subset=['year', 'month', 'day', 'hour', 'pm2'])
imp = Imputer(strategy = 'median')
pm2_regr_imp = imp.fit_transform(pm2_regr_imp)
In [25]:
lr.fit(pm2_regr_imp[:, :-1], pm2_regr_imp[:, -1])
lr.score(pm2_regr_imp[:, :-1], pm2_regr_imp[:, -1])
Out[25]:
The result is slightly better than simply imputing mean or median, but still worse than dropping all NAs.
Sometimes your data will contain categorical variables which need to be handled carefully depending on the machine learning algorithm you choose to use. Encoding categorical variables comes in two flavors: oridinal (ordered) and nominal (unordered) features. In this exercise, you'll further explore the Beijing PM2.5 dataset, this time using categorical variables.
1 - Which variables are categorical? Encode them properly, taking care to insure that they are properly classified as either ordinal or nominal.
There is one categorical variable:
In [26]:
pm2.describe(include=['O'])
Out[26]:
The variable is nominal, so I'm going to use one-hot encoding:
In [27]:
# for simplicity I'm using pandas function
pm2_enc = pd.get_dummies(pm2)
In [28]:
pm2_enc = pm2_enc.loc[:, regr_cols[:-1] + ['wind_dir_NE', 'wind_dir_NW', 'wind_dir_SE', 'wind_dir_cv'] + regr_cols[-1:]].dropna()
In [ ]:
# from solutions using sklearn:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
l_enc = LabelEncoder()
oh_enc = OneHotEncoder(sparse=False)
# change categorical data labels to integers
data_sub = pm2.copy()
data_sub.wind_dir = l_enc.fit_transform(data_sub.wind_dir)
# one-hot encode
dummies = pd.DataFrame(oh_enc.fit_transform(data_sub.wind_dir.values.reshape(-1, 1)), columns=l_enc.classes_)
# join with original df
data_sub = data_sub.drop('wind_dir', axis=1)
data_sub = data_sub.join(dummies)
data_sub.head()
2 - Perform a multilinear regression, using the classified data, removing the NA
values. Comment on your results.
In [29]:
lr.fit(pm2_enc.iloc[:, :-1], pm2_enc.iloc[:, -1])
lr.score(pm2_enc.iloc[:, :-1], pm2_enc.iloc[:, -1])
Out[29]:
The results are a bit better than before, but the performances are still very bad.
3 - Create a new encoding for days in which it rained, snowed, neither, and both, and then rerun the regression. Are the results any better?
In [30]:
# hours_snow and hours_rain are cumulative across days, so I'm taking the max for each day to see if it snowed
days = pm2_enc.groupby(['year', 'month', 'day'])['hours_snow', 'hours_rain'].max()
# creating columns for the encodings
days['snow'] = pd.Series(days['hours_snow'] > 0, dtype='int')
days['rain'] = pd.Series(days['hours_rain'] > 0, dtype='int')
days['rain_snow'] = pd.Series((days['hours_rain'] > 0) & (days['hours_snow'] > 0), dtype='int')
days['no_rain_snow'] = pd.Series((days['hours_rain'] == 0) & (days['hours_snow'] == 0), dtype='int')
# resetting index and dropping hours_snow and hours_rain
days.reset_index(inplace=True)
days.drop(['hours_snow', 'hours_rain'], inplace=True, axis=1)
In [31]:
# joining the dataframe with the new columns to the original one
pm2_enc = pm2_enc.merge(days, left_on=['year', 'month', 'day'], right_on=['year', 'month', 'day'])
In [32]:
pm2_enc.info()
In [33]:
lr.fit(pm2_enc.iloc[:, :-1], pm2_enc.iloc[:, -1])
lr.score(pm2_enc.iloc[:, :-1], pm2_enc.iloc[:, -1])
Out[33]:
In [ ]:
Wow, now the fit is perfect!
4 - Create a new encoding for the quartile that a day falls under by wind speed and rerun the regression. Comment on your results.
In [34]:
# using pandas cut and subtracting 0.1 to include the min values
pm2_enc['wind_speed_quartile'] = pd.cut(pm2_enc.wind_speed,
bins=list(pm2_enc.wind_speed.quantile([0])-0.1) + list(pm2_enc.wind_speed.quantile([0.25, 0.5, 0.75, 1])),
labels=[0.25, 0.5, 0.75, 1])
In [ ]:
# from solutions: using np.percentile:
quartile = np.percentile(data_sub['wind_speed_quartile'], [25, 50, 75, 100])
cat = []
for row in range(len(data_sub)):
wind_speed = data_sub['wind_speed_quartile'].iloc[row]
if wind_speed <= quartile[0]:
cat.append('1st')
if wind_speed <= quartile[1]:
cat.append('2nd')
if wind_speed <= quartile[2]:
cat.append('3rd')
if wind_speed <= quartile[3]:
cat.append('4th')
data_sub['wind_quart'] = cat
# and then create dummies...
In [ ]:
# transforming the column in numeric
pm2_enc.wind_speed_quartile = pd.to_numeric(pm2_enc.wind_speed_quartile)
In [30]:
lr.fit(pm2_enc.iloc[:, :-1], pm2_enc.iloc[:, -1])
lr.score(pm2_enc.iloc[:, :-1], pm2_enc.iloc[:, -1])
Out[30]:
The accuracy has gone down again after adding this new column, this may be due to the fact that this adds useless noise to the data or that this binning is too coarse maybe.
5 - Create a new encoding for deciles of the DEWP
variable. Then select the row containing the highest temperature, and using Pandas
category
data type, select all rows in a lesser DEWP
decile than this row.
In [31]:
# using pandas cut and subtracting 0.1 to include the min values
pm2_enc['dew_point_decile'] = pd.cut(pm2_enc.dew_point,
bins=list(pm2_enc.dew_point.quantile([0])-0.1) + list(pm2_enc.dew_point.quantile([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])))
In [ ]:
# from solutions: not using cut but creating new column and then:
data_sub.dew_dec = pd.Categorical(data_sub.dew_dec, categories=data_sub.dew_dec.unique(), ordered=True)
In [32]:
decile = pm2_enc.iloc[pm2_enc.temperature.argmax()].dew_point_decile
print(decile)
pm2_enc.loc[pm2_enc.dew_point_decile < decile]
Out[32]:
1 - Head over to the Machine Learning Repository, download the Wine Dataset, and put it in a dataframe, being sure to label the columns properly.
In [33]:
wine = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None)
wine.columns = ['class', 'alcohol', 'malic_acid', 'ash', 'alcalinity_ash', 'magnesium', 'total_phenols',
'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity',
'hue', 'OD280_OD315', 'proline']
In [34]:
wine.head()
Out[34]:
2 - Fit a Nearest Neighbors model to the data, using a normalized data set, a stardardized data set, and the original. Split into test and train sets and compute the accuracy of the classifications and comment on your results.
In [35]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
Original dataset:
In [36]:
Xtrain, Xtest, ytrain, ytest = train_test_split(wine.iloc[:, 1:], wine.iloc[:, 0], test_size=0.3, random_state=0)
knn = KNeighborsClassifier()
knn.fit(Xtrain, ytrain)
print(knn.score(Xtrain, ytrain))
print(knn.score(Xtest, ytest))
Normalized dataset:
In [37]:
mms = MinMaxScaler()
Xtrain_norm = mms.fit_transform(Xtrain)
Xtest_norm = mms.transform(Xtest)
knn.fit(Xtrain_norm, ytrain)
print(knn.score(Xtrain_norm, ytrain))
print(knn.score(Xtest_norm, ytest))
Standardized dataset:
In [38]:
ssc = StandardScaler()
Xtrain_std = ssc.fit_transform(Xtrain)
Xtest_std = ssc.transform(Xtest)
knn.fit(Xtrain_std, ytrain)
print(knn.score(Xtrain_std, ytrain))
print(knn.score(Xtest_std, ytest))
The accuracy is way better for a normalized or standardized dataset, with the least having a slightly better generalization: K-Nearest Neighbors is sensitive to feature scaling.
3 - Fit a Naive Bayes model to the data, using a normalized data set, a standardized data set, and the original. Comment on your results.
In [39]:
from sklearn.naive_bayes import GaussianNB
Original dataset:
In [40]:
gnb = GaussianNB()
gnb.fit(Xtrain, ytrain)
print(gnb.score(Xtrain, ytrain))
print(gnb.score(Xtest, ytest))
Normalized dataset:
In [41]:
gnb.fit(Xtrain_norm, ytrain)
print(gnb.score(Xtrain_norm, ytrain))
print(gnb.score(Xtest_norm, ytest))
Standardized dataset:
In [42]:
gnb.fit(Xtrain_std, ytrain)
print(gnb.score(Xtrain_std, ytrain))
print(gnb.score(Xtest_std, ytest))
For this algorithm there is no difference at all, so scaling the data isn't necessary.
With many datasets, you will find yourself in a situation where not all of the provided features are relevant to your model and it may be best to discard them. This is a very complex topic, involving many techniques, a few of which we will explore in this exercise, using the Boston
housing data.
1 - From sklearn import the load_boston
package, and put the data into a data frame with the proper column names, and then split into training and testing sets.
In [2]:
from sklearn.datasets import load_boston
boston = load_boston()
In [3]:
boston_df = pd.DataFrame(boston.data, columns=boston.feature_names)
boston_target = boston.target
In [4]:
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(boston_df, boston_target, test_size=0.3, random_state=0)
2 - Fit a series of least squares multilinear regression models to the data, and use the F-Statistic to select the K best features for values of k
ranging from 1
to the total number of features. Plot the MSE for each model against the test set and print the best features for each iteration. Comment on your results.
In [5]:
from sklearn.feature_selection import f_classif, SelectKBest
from sklearn.metrics import mean_squared_error
In [7]:
# in the solutions he uses f_regression and not f_classif
# also, best features are obtained by cols[sel.get_support()] with cols = Xtrain.columns
# and lr is instantiated with normalize=True
from sklearn.feature_selection import f_regression
from sklearn.linear_model import LinearRegression
mse = []
cols = Xtrain.columns
lr = LinearRegression(normalize=True)
# looping through the number of features desired and storing the results in mse
for k in range(1, boston_df.shape[1]+1):
# using SelectKBest with the F-statistic as the score
sel = SelectKBest(score_func=f_regression, k=k)
# fitting the selector
sel.fit(Xtrain, ytrain)
# transforming train and test sets
Xtrain_k = sel.transform(Xtrain)
Xtest_k = sel.transform(Xtest)
# fitting linear regression model and printing out the k best features
lr.fit(Xtrain_k, ytrain)
print('Top {} features {}'.format(sel.k, cols[sel.get_support()]))
mse.append(mean_squared_error(lr.predict(Xtest_k), ytest))
In [8]:
mse
Out[8]:
In [54]:
mse = []
# looping through the number of features desired and storing the results in mse
for k in range(1, boston_df.shape[1]+1):
# using SelectKBest with the F-statistic as the score
sel = SelectKBest(score_func=f_classif, k=k)
# fitting the selector
sel.fit(Xtrain, ytrain)
# transforming train and test sets
Xtrain_k = sel.transform(Xtrain)
Xtest_k = sel.transform(Xtest)
# fitting linear regression model and printing out the k best features
lr.fit(Xtrain_k, ytrain)
print('Top {} features {}'.format(k, pd.Series(sel.scores_, index=Xtrain.columns).\
sort_values(ascending=False).\
head(k).index.values))
mse.append(mean_squared_error(lr.predict(Xtest_k), ytest))
In [48]:
import matplotlib.pyplot as plt
%matplotlib inline
In [55]:
plt.figure(figsize=(16, 8))
plt.plot(range(1, len(mse)+1), mse)
plt.title('MSE for models with different number of features')
plt.xlabel('Number of Features')
plt.ylabel('MSE');
The MSE keeps going down adding features, there is a great gain after the 11th feature is added.
3 - Do the same as in part (2) instead this time using recursive feature selection.
In [50]:
from sklearn.feature_selection import RFE
In [51]:
mse = []
# looping through the number of features desired and storing the results in mse
for k in range(1, boston_df.shape[1]+1):
# using Recursive Feature Selection with linear regression as estimator
sel = RFE(estimator=lr, n_features_to_select=k)
# fitting the selector
sel.fit(Xtrain, ytrain)
# transforming train and test sets
Xtrain_k = sel.transform(Xtrain)
Xtest_k = sel.transform(Xtest)
# fitting linear regression model and printing out the k best features
lr.fit(Xtrain_k, ytrain)
print('Top {} features {}'.format(k, pd.Series(sel.support_, index=Xtrain.columns).\
sort_values(ascending=False).\
head(k).index.values))
mse.append(mean_squared_error(lr.predict(Xtest_k), ytest))
In [52]:
plt.figure(figsize=(16, 8))
plt.plot(range(1, len(mse)+1), mse)
plt.title('MSE for models with different number of features')
plt.xlabel('Number of Features')
plt.ylabel('MSE');
The MSE keeps going down adding features but after the sixth feature is added there isn't much improvement.
4 - Fit a Ridge Regression model to the data and use recursive feature elimination and SelectFromModel
in sklearn
to select your features. Generate the same plots and best features as in parts (2) and (3) and comment and compare your results to what you have found previously.
In [ ]:
# in solutions he doesn't use select from model but repeats the previous exercise only using ridge instead
# ok no, it does both
In [21]:
# in selectfrommodel it uses c_vals = np.arange(0.1, 2.1, 0.1) to loop through and the threshold is set to
# str(c) + '*mean' for c in c_vals
# also, he always fits the ridge model
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectFromModel
# fitting ridge regression
ridge = Ridge()
c_vals = np.arange(0.1, 2.1, 0.1)
cols = Xtrain.columns
mse = []
# looping through the possible threshholds from above and storing the results in mse
for c in c_vals:
# using SelectFromModel with the ridge scores from above
selfrmod = SelectFromModel(ridge, threshold=str(c) + '*mean')
# fitting the selector
selfrmod.fit(Xtrain, ytrain)
# transforming train and test sets
Xtrain_k = selfrmod.transform(Xtrain)
Xtest_k = selfrmod.transform(Xtest)
# fitting linear regression model and printing out the k best features
ridge.fit(Xtrain_k, ytrain)
print('c={} features {}'.format(c, cols[selfrmod.get_support()]))
mse.append(mean_squared_error(ridge.predict(Xtest_k), ytest))
In [22]:
mse
Out[22]:
In [23]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(16, 8))
plt.plot(c_vals, mse)
plt.title('MSE for different thresholds')
plt.xlabel('c')
plt.ylabel('MSE');
In [56]:
from sklearn.linear_model import Ridge
# from sklearn.feature_selection import RFECV
from sklearn.feature_selection import SelectFromModel
# fitting ridge regression
ridge = Ridge()
ridge.fit(Xtrain, ytrain)
# storing features importance
coef = ridge.coef_
mse = []
# looping through the possible threshholds from above and storing the results in mse
for k, thresh in enumerate(sorted(coef, reverse=True)):
# using SelectFromModel with the ridge scores from above
selfrmod = SelectFromModel(ridge, threshold=thresh)
# fitting the selector
selfrmod.fit(Xtrain, ytrain)
# transforming train and test sets
Xtrain_k = selfrmod.transform(Xtrain)
Xtest_k = selfrmod.transform(Xtest)
# fitting linear regression model and printing out the k best features
lr.fit(Xtrain_k, ytrain)
print('Top {} features {}'.format(k+1, pd.Series(ridge.coef_, index=Xtrain.columns).\
sort_values(ascending=False).\
head(k+1).index.values))
mse.append(mean_squared_error(lr.predict(Xtest_k), ytest))
In [57]:
plt.figure(figsize=(16, 8))
plt.plot(range(1, len(mse)+1), mse)
plt.title('MSE for models with different number of features')
plt.xlabel('Number of Features')
plt.ylabel('MSE');
After the fourth feature there is no improvement.
Also, the MSE seems better than all the trials before.
5 - L1 regularization can also be used for model selection. Choose an algorithm in sklearn
and repeat part (4) using model selection via regularization.
In [24]:
# again, in solutions he uses the c_vals as before and he fits the lasso
from sklearn.linear_model import LassoCV
from sklearn.feature_selection import SelectFromModel
# fitting ridge regression
lasso = LassoCV()
c_vals = np.arange(0.1, 2.1, 0.1)
cols = Xtrain.columns
mse = []
# looping through the possible threshholds from above and storing the results in mse
for c in c_vals:
# using SelectFromModel with the ridge scores from above
selfrmod = SelectFromModel(lasso, threshold=str(c) + '*mean')
# fitting the selector
selfrmod.fit(Xtrain, ytrain)
# transforming train and test sets
Xtrain_k = selfrmod.transform(Xtrain)
Xtest_k = selfrmod.transform(Xtest)
# fitting linear regression model and printing out the k best features
lasso.fit(Xtrain_k, ytrain)
print('c={} features {}'.format(c, cols[selfrmod.get_support()]))
mse.append(mean_squared_error(lasso.predict(Xtest_k), ytest))
In [25]:
mse
Out[25]:
In [26]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(16, 8))
plt.plot(c_vals, mse)
plt.title('MSE for different thresholds')
plt.xlabel('c')
plt.ylabel('MSE');
In [58]:
from sklearn.linear_model import LassoCV
# fitting lasso regression
lasso = LassoCV()
lasso.fit(Xtrain, ytrain)
# storing features importance
coef = lasso.coef_
mse = []
# looping through the possible threshholds from above and storing the results in mse
for k, thresh in enumerate(sorted(coef, reverse=True)):
# using SelectFromModel with the lasso scores from above
selfrmod = SelectFromModel(lasso, threshold=thresh)
# fitting the selector
selfrmod.fit(Xtrain, ytrain)
# transforming train and test sets
Xtrain_k = selfrmod.transform(Xtrain)
Xtest_k = selfrmod.transform(Xtest)
# fitting linear regression model and printing out the k best features
lr.fit(Xtrain_k, ytrain)
print('Top {} features {}'.format(k+1, pd.Series(lasso.coef_, index=Xtrain.columns).\
sort_values(ascending=False).\
head(k+1).index.values))
mse.append(mean_squared_error(lr.predict(Xtest_k), ytest))
In [59]:
plt.figure(figsize=(16, 8))
plt.plot(range(1, len(mse)+1), mse)
plt.title('MSE for models with different number of features')
plt.xlabel('Number of Features')
plt.ylabel('MSE');
The performances are similar as before, but this time the improvement in MSE stops at 6 features instead of 4.
It should be noted that all the methods selected different subsets of features as optimal!